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(linear  and  nonlinear)  least  squares,  robust  regression,  logistic  regression,  and 
Poisson  regression.  These  problems  have  a  common  structure  which  may  often  be 
worth  exploiting,  especially  in  cases  where  r(x)  is  a  nonlinear  function  of  x. 
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either  large  or  small.  Appendices  consider  the  numerical  linear  algebra 
involved  in  computing  a  trial  change  to  the  parameter  vector  x  and  give  a  new 
perturbation  theorem  on  linear  least  squares  that  is  useful  in  analyzing  the 
numerical  behavior  of  the  procedure  recommended  for  computing  this  trial  change. 
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SIGNIFICANCE  AND  EXPLANATION 


\ 

—‘Many  researchers  employ  mathematical  models.  Most  models  contain  param¬ 
eters,  which  may  be  chosen  to  make  the  model  fit  the  available  data  as  well  as 
possible  (in  a  sense  that  depends  on  the  model) .  In  this  paper  we  consider  the 
problem  of  choosing  the  parameters  for  a  common  class  of  models  in  which  the 
desired  parameter  vector  a*  minimizes  an  (unconstrained)  objective  function, 

of  the  form  T.  p.(r.(x)),  where  r.  is  the  ith  (generalized)  residual  af^ 
‘-111  1  _  . _ - 

the  model  and  is  a  scalar  criterion  function.  '  (Often  r^  is  the  model's 

error  at  the  ith  observation.)-^  We  briefly  give  some  examples  of  such 
problems,  then  discuss  ways  to  exploit  the  common  structure  that  these  problems 
share.  This  leads  us  to  discussing  strategies  for  solving  general  unconstrained 
minimization  problems  and  to  point  out  the  advantages  of  using  a  so-called 
"model/trust-region  approach,’'  wherein  the  change  made  in  the  current  parameter 
estimate  is  chosen  so  as  to  approximately  minimize  a  local  model  of  the  objective 
function  on  an  estimate  of  the  region  about  the  current  iterate  where  this  local 
model  is  reliable.  For  problems  in  which  the  residual  vector  r(x)  is  a  non¬ 
linear  function  of  x,  we  recommend  generalizations  of  some  techniques  that 
have  proven  worthwhile  in  nonlinear  least-squares  problems  in  which  the  optimal 


resid  lal  vector  r(x*)  may  be  either  large  or  small. 


The  responsibility  for  the  wording  and  views  expressed  in  this  descriptive 
summary  lies  with  MRC,  and  not  with  the  author  of  this  report. 


ON  SOLVING  ROBUST  AND  GENERALIZED  LINEAR  REGRESSION  RRCbLENr* 
David  M.  Gay 


1.  Introduction 

A  number  of  regression  problems  require  finding  a  parameter  vector  ;•;*  v.:,icr, 
minimizes  an  objective  function  •f  :  JR*'*  -*•  IR  of  the  form 

n 

(1.1)  V!(x)  =  y  p.  (r.  (x))  , 

i=l  ^  ^ 

where  r^  :  ]R^^  “*■  IR.  IR  -<■  K,  and  n  ^  p.  Often  r(x)  :  =  (r^(>;),  '  '  ‘  ,  r_ 

is  the  residual  vector  corresponding  to  a  linear  or  nonlinear  model;  we  shall  t:..  i  f  r 
call  it  the  (generalized)  "residual  vector"  even  in  cases  where  minir.izinc;  i  is  t.' • 
intended  to  malre  r(x*)  near  zero.  In  the  case  of  robust  regression,  is  -  .r'l.  - 

times  called  a  robust  loss  function.  But  "loss"  has  other  meanings,  so  we-  shall  r-  f-  .' 
to  as  the  ith  criterion  function. 

Problems  of  the  form  (1.1)  have  in  common  a  structure  which,  we  suspect,  it  will 
often  prove  worthwhile  to  exploit,  especially  when  the  residual  vector  r(x)  is  a  ncr.- 
linear  function  of  x.  After  giving  examples  of  (1.1)  in  the  next  section,  wv  ri.striit 
our  attention  in  §3  to  the  common  case  where  r(x)  is  an  affine  function  of  x.  Ti.is 
gives  us  the  opportunity  to  discuss  some  ideas  applicable  to  genera]  unconstrained 
optimization  and,  in  particular,  to  point  out  the  advantages  of  a  model/trust-regior. 
approach.  In  §4  we  turn  to  the  general  case  where  r(x)  is  nonlinear  and  recommend 
trying  the  obvious  generalizations  of  some  ideas  that  have  proven  wortliwhile  for  non¬ 
linear  least  squares.  Appendix  A  deals  with  the  numerical  linear  algebra  of  solving 
the  special  linear  least-squares  problems  that  arise  when  computing  certain  "Newton" 
steps,  and  Appendix  B  states  and  proves  a  perturbation  theorem  used  in  Ap'pendix  A. 
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2.  Exami'los 


t'ernaps  the  most  comition  piroblem  of  the  form  (1.1)  is  the  least-squares  problem, 

1  2 

in  '.■.•Jiici!  .  “  T  ^  piroblem  may  arise,  for  example,  when  one 

has  a  model  f^(x)  that  predicts  what  experimental  response  will  be  obtained  under 

the  ith  set  of  conditions,  and  one  measures  response  under  these  conditions; 

if  there  are  independent,  normally  distributed  errors  in  the  y^,  then  the  maximum- 

likelihood  estimate  x*  for  the  model  parameters  minimizes  (1.1)  with  r^(x)  = 

f ,  (x)  -  y  .  and  p  .  (T )  =  i  - . 

11  1  2 

The  least-squares  parameter  estimate  can  be  strongly  influenced  by  errors  in  the 
data  y^,  so  a  number  of  so-called  "robust  regression"  techniques  have  been  proposed 
for  obtaining  good  estimates  of  the  model  parameters  in  cases  where  some  of  the  y^ 
contain  large  errors.  One  such  approach  involves  solving  (1.1)  with  p.  a  robust 


criterion  function  and  r^(x)  =  (f^(x)  -  y.)/a,  where  o  is  a  scale  parameter  that 
has  been  determined  by  some  other  means.  A  number  of  robust  criterion  functions  have 
been  proposed,  such  as  the  eight  choices  considered  by  Holland  and  Welsch  in 
[H01W771.  These  include  the  criterion  functions  of  Huber  [Hub641, 


of  Fair  lFai741 


I  if  1t|  <  H 


(ill  -  H/2)H  if  |t I  >  H 


(T)  =  F[|t|  -  F  •  ln(l  +  |t|/F)1 


of  '.velsch  |DenK78], 


P^(t)  =jW^[1  -  exp(-T^/W^)l  , 


and  of  Hinich  and  Talwar  li)inT75]  , 


p  .  (t)  = 


if  T  <  T 


T  /2  if  IT  1  >  T  . 


-2- 


On  probifins  with  small  residuals,  all  of  these  cirterion  functions  behave  very  much 


like  the  least-squares  criterion  function.  The  tuning  constants  H,  F,  W,  T  of 


(2.1-'4)  may  be  chosen  to  make  (1.1)  yeild  parameter  estimates  with  a  specified  level 


of  efficiency  -  see  [HolW77] 


The  generalized  linear  models  of  Nelder  and  Wedderburn  [NelW72]  give  rise  to  other 


problems  of  the  form  (1.1).  In  addition  to  the  least-squares  problem,  these  include 


The  logistic  regression  problem  arises  if  one  models  the  probability  tt  .  of 


success  in  an  experiment  under  the  ith  set  of  conditions  by  Tr./d  “ 


V.  times  under  these  conditions  and  o 


occur,  then  (under  reasonable  assumptions)  the  probability  of  the  observed  outcomes  is 


and  choosing  x*  to  maximize  (the  logarithm  of)  this 


probability  amounts  to  minimizing  (1.1)  with 


observes  n  indepjendent  Poisson  processes  and  obtains  a  count  of  v.  for  the  ith 


process,  an  event  of  probability  e 


If  one  models  X 


then 


finding  a  maximum  likelihood  estimate  x*  for  x  amounts  to  minimizing  (1.1)  with 


With  the  excep^tion  of  (2.4),  all  of  the  above  criterion  functions  are  continuously 


differentiable.  The  discontinuity  in  the  p!  of  (2.4)  causes  no  serious  trouble  for 


the  iterative  schemes  considered  below,  at  least  so  long  as  r(x)  is  continuously 


differentiable,  since  (as  is  easily  seen) ,  points  of  discontinuity  are  not  points  of 


attraction  for  these  schemes.  The  discontinuity  in  the  Huber  pV  (2.1)  causes  no 


problems  cither,  so  we  will  refrain  from  further  discussion  of  the  continuity  of  the 


criterion  functions  and  will  assume  them  to  be  at  least  twice  differentiable  in  what 


3.  Linear  Problems 


Often  when  the  problems  sketched  in  §2  arise,  r(x)  is  a  linear  or  affine 


(3.1)  r (x)  =  Ax  -  b  , 

nxp  n 

where  A  e  ]R  (i.e.,  A  is  an  n  x  p  matrix)  and  b  e  IR  .  In  this  case,  the  cradle 

and  Hessian  of  have  particularly  simple  forms: 

(3.2)  Vs?  (x)  =  a'^  p'(r(x))  and 

(3.3)  s;(x)  =  a'^  P(p"(r(x)))A  , 

where  p'(r(x))=  [p'(r  (x)),  •  •  ■  ,  p'(r  (x))]'^  and  P(p"(r(x)))  =  diag  (c "  (r  (x) )  , 
11  n  n  11 

T 

•  •  •  ,p"(r  (x) ) )  is  the  diagonal  matrix  whose  ith  diagonal  element  is  ;'.'(r.(x)). 
n  n  ’  11 

Since  (3.3)  has  such  a  simple  form,  it  is  reasonable  to  consider  using  the  dampe 
Newton ' s  method 

(3.4)  =  x*"  -  v!(x'^)"^  V^(x'") 

to  construct  a  sequence  of  iterates  which,  under  reasonable  conditions,  converge  to 

a  (possibly  local)  minimizer  x*  of  (1.1).  In  (3.4),  is  a  step  length  paranete 

1  )c  )c 

chosen  to  assure  v?  (x  )  <  ■;  (x  )  when  V14  (x  )  ^  0.  It  is  unnecessary  to  choose 

k+1 

as  to  (nearly)  minimize  >4  (x  );  indeed,  it  is  nowadays  generally  recognized  as 

k  k+1 

inefficient  to  attempt  this.  But  it  is  important  to  make  •;  (x  )  -  ■;  (x  )  large 

enough  that  the  iterates  do  not  converge  to  a  noncritical  point  of  .  See  [Fow71], 

[GilM74],  and  §6.3  of  [DenS791  for  discussion  of  efficient  ways  to  do  this. 

If  p^  is  the  least-squares  criterion  function,  A  has  rank  p,  and  =  1, 

then  (3.4)  converges  in  one  iteration  (when  exact  arithmetic  is  used).  In  this  case, 

T  -IT 

(3.4)  amounts  to  the  normal  equations:  x*  =  (A  A)  A  b.  When  finite-precision 

T  T  T  T 

arithmetic  is  used,  explicitly  computing  A  A  and  A  b  and  solving  A  A  x*  =  A 


works  well  if  A  is  well  conditioned,  i.e.,  if  the  condition  number 
T  T  -1  1/2 

K  =  (li  A  All  II  (A  A)  II)  of  A  is  not  too  large.  For  large  values  of  <  it  is 


-4- 


usually  much  more  accurate  to  employ  a  QR  factorization  of  A,  i.c.  ,  to  factor 


as  QR, 

where 

Q  t 

has 

orthonormal 

columns  and  R  c  IR  is  upper  triangular 

and  then 

solve 

T 

R  X*  =  Q 

b  - 

see  [LawH74] 

2  3 

i.  (This  costs  roughly  n  p  -  1  /3 

multiplications  and  a  similar  number  of  additions,  versus  roughly  nr  /2  +  p  /6  for 
explicitly  solving  the  normal  equations,  so  using  the  QR  factorization  less  tnan 
doubles  the  arithmetic  overhead) . 

For  other  criterion  functions  having  >  0  for  all  i,  such  as  (2.2),  (2.5), 

and  (2.6),  we  can  readily  convert  the  task  of  computing  the  Newton  direction 
2  k  -1  k 

-V  (x  )  (x  )  into  that  of  solving  a  linear  least-squares  problem  by  setting 

w  k  1  /P 

(3.5a)  A  =  p(p"(r(x  ))  )A  and 

k  k  - 1 /2  k 

(3.5b)  b  =  p(p*’(r(x  ))  )p'(r(x  )) 

where  p(p"(r)-^^^)  =  diag (p '' (r  ) -^^^ ,  *  *  *  ,  p'*(r  )*^'^^)  ,  so  that  7"  v"  (x^)  = 

11  n  n 

kTk  kkTk  k 

(A  )  A  and  V'P(x  )  =  (A  )  b  .  Using  a  properly  computed  QR  factorization  of  A 

will  then  usually  lead  to  a  more  accurately  computed  value  for  the  Newton  direction. 

There  are  three  commonly  used  ways  to  compute  a  QR  factorization  for  use  in 
solving  linear  least-squares  problems:  triangularization  via  Householder  transforma¬ 
tions  (elementary  reflectors) ,  triangularization  via  (standard  or  fast)  Givens 
transformations  (plane  rotations) ,  and  the  stabilized  Gram-Schmidt  process  -  see 
(LawH74].  While  all  three  yield  similar  numerical  accuracy  on  a  broad  range  of 
problems,  use  of  the  first  two  techniques  on  (3.5)  can  sometimes  lead  to  a  severe  loss 
of  accuracy  in  cases  when  p"(r)  has  a  component  near  zero.  This  point  is  discussed 
in  detail  in  Appendix  A,  where  it  is  shown  that  the  stabilized  Gram-Schmidt  p'rocess 
does  not  share  this  drawbac)4. 

Some  cost  functions,  such  as  (2.3),  can  have  ppr^(x))  <  0  for  certain  i,  so 

2 

It  IS  possible  for  the  Newton  direction  -V  v’(x)V'^(x)  to  be  uphill,  nonexistent,  or 
nearly  orthogonal  to  the  gradient  Vi^  (x)  .  One  way  to  overcome  this  difficulty  is  to 
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2  k  k 

re{:  lace  V*”  (x  )  in  (3.4)  by  a  positive  definite  matrix  H  ,  so  that  (3.4) 

becomes 

a  ^+1  k  ,  /,,k  -1  „  ,  k, 

(3.b)  X  =  X  -  ^  VvJ(x  ) 

For  solving  robust  regression  problems,  Huber  (Hub75]  has  considered  choosing 
(3.7)  =  a'^  A  . 

Since  H  remains  the  same  for  all  k,  this  choice  lets  one  avoid  computing  new 
matrix  factorizations  after  the  first  iteration.  But  it  often  converges  slowly. 
Holland  and  Welsch  [HolW77J  advocate  using  an  H  that  is  generally  attributed  to 
Beaton  and  Tukey  lBeaT741: 


(3.8a) 

(3.8b) 


k  T  k 

H  =  A  P(w(r(x  )))A  ,  where 


w(r)  =  (D^(r^)/r^, 


,  p' (r  )/r  ) 
n  n  n 


This  converges  more  rapidly  than  (3.7)  but  more  slowly  than  Newton's  method  (near  the 


solution)  lHolW77].  (Since  p!(0)  =  0  for  robust  criterion  functions,  (3,8)  amounts 

1 

p^  (r .  )  -  p^  (0) 

to  replacing  DV(r.)  in  (3.3)  by  the  finite  difference  w.(r.)  =  — - - - -  >  0.) 

11  11  r.-O 


With  this  choice  of  H  ,  it  is  again  possible  to  compute  the  step  direction, 

k“lk  kTkk 

-(H  )  Vi;  (x  ),  by  linear  least  squares,  since  (x  )  =  A  p(w(r(x  )))r(x  ):  when 


done  with  =  1,  this  is  known  as  iteratively  reweighted  least-squares.  Byrd  and 

k  0 

Pyne  [ByrP79]  have  proven  the  remarkable  result  that  lim  (x  ]  =0  for  any  x 

k-«o 


with  this  approach.  Because  robust  regression  problems  are  likely  to  have  a  number  of 

local  minima,  Holland  and  Welsch  (HolW77J  recommend  starting  from  an  x^  that 
n 

minimizes  II  Ax  -  b  11^^  =  [  |  (Ax  -  b)  .  |  . 

i=l  ^ 

In  connection  with  general  unconstrained  minimization,  a  number  of  other  choices 

for  H  have  been  considered  in  the  literature.  Greenstadt,  for  instance,  has  proposed 

2  k 

replacing  negative  eigenvalues  of  V  (x  )  by  their  absolute  values  IGre671.  Murray 
has  nroposed  an  interesting  modified  Cholesky  factorization  (Mur72].  And  Mor^  and 
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Sorensen  {MorS77]  have  proposed  modifying  the  eigenvalues  of  the  block  diagonal  matrix 

2  X 

in  the  bunch-Parlett  (BunP71)  factorization  of  V  s?  (x  ) .  Unfortunately,  all  of  these 

k  -1  k 

schemes  can  generate  search  directions  •  (H  )  (x  )  which  are  nearly  orthogonal  to 

2  k 

t}ie  gradient  in  certain  cases  when  V  (x  )  is  poorly  conditioned.  This  can  cause 
slow  convergence. 

Secant  update  methods  (see  lDenM77],  where  they  are  called  quasi-Newton  methods) 

provide  another  way  to  generate  a  reasonable  positive  definite  substitute  H  for 
2  k 

7  (x  ) .  They  offer  the  advantage  of  less  arithmetic  overhead  per  iteration  than 

2 

methods  which  deal  explicitly  with  V  but  they  also  converge  more  slowly  than 

these  methods.  They  should  be  faster  than  (3.7),  and  at  this  writing  it  is  unclear 
how  they  fare  in  comparison  with  (3.8). 

Starting  with  Fletcher  and  Freeman  lFleF75)  and  McCormick  lMcC77) ,  there  has  been 

considerable  interest  of  late  in  exploiting  negative  curvature.  The  idea  is  to  replace 

k  -1  k 

the  single  search  direction  -{H  )  Vv?  (x  )  in  (3.6)  by  what  Sorensen  lSor77]  has 

k  k  k  k  —  1  k 

termed  a  descent  pair  (s  ,  d  )  ,  in  which  s  =  -(H  )  V'f  {a  )  for  some  positive 

k  kT2kk  2k 

definite  H  and  (d  )  V  )d  ^0,  with  equality  only  if  V  >,?  (x  )  is  positive 

k+1  k  k  k 

semidefinite.  In  such  a  scheme,  x  -  x  is  a  linear  combination  of  s  and  d  : 

2  k  k 

Goldfarb  (Gol77]  suggests  a  combination  of  the  form  a  s  +  a  d  ,  while  Mor^  and 

k  2  k  k 

Sorensen  (MorS77]  prefer  as  +  a  d  .  The  choice  of  s  recommended  in  fGol77] 
and  (:torS77j  can  be  nearly  orthogonal  to  the  gradient  in  some  cases  (and  the  higher 
the  arithmetic  precision,  the  more  nearly  orthogonal  s  and  the  gradient  can  be) , 
so  these  schemes  need  further  study. 

The  model/trust-region  approach  offers  a  more  elegant  way  of  dealing  with  cases 
2 

where  7  ^  may  not  be  sufficiently  positive  definite.  This  approach  can  generate 

k+1  k  T  2  k  k+1  k 

stops  in  a  direction  of  negative  curvature  (i.e.,  (x  -  x  )  V  (x  )  (x  -  x  )  <  0 
is  jjossiblc) ,  but  in  some  ways  it  is  simpler  than  the  schemes  that  deal  explicitly 
with  negative  curvature.  Moreover,  it  avoids  the  slow  convergence  that  can  result 
from  search  directions  nearly  orthogonal  to  the  gradient,  and  it  can  converge  faster 
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have  thu  form 


(3.10)  =  {s  e  :  II  Ds  II  <  , 

where  D  is  a  diaqonal  matrix  having  >  0.  While  choosing  D  =  I,  the  identity 

matrix,  works  well  on  well-scaled  problems,  a  propjer  choice  of  the  scale  matrix  L 
can  lead  to  significantly  faster  convergence  on  problems  where  the  components  of  x 
are  expressed  in  sharply  contrasting  units.  For  the  linear  piroblems  of  present  con¬ 
cern,  choosing  to  be  some  convenient  norm  of  the  ith  column  of  A  is  usually 

reasonable . 

If  the  norm  II  •  II  in  (3.10)  is  the  max-norm  (i.e.,  II  y  II  =  H  y  11^^  = 
max  {[y^l  :  1  £  i  PK  which  at  first  is  appealing  if  one  wishes  to  consider 
generalizing  to  problems  having  simple  bounds,  such  as  nonnegativity  constraints,  or. 
sonie  components  of  x)  ,  then  computing  the  optimal  s  amounts  to  solving  a  quadrat 
programming  problem  with  particularly  simple  constraints.  Unfortunately,  if  H  has 
some  negative  eigenvalues,  then  this  quadratic  programming  problem  can  have  a  number 

of  local  minima  (as  many  as  2^) , 

k  k 

Even  when  H  is  positive  definite,  it  is  often  possible  to  compute  s  moro 

rapidly  or  more  accurately  if  II  •  II  in  (3.10)  is  the  2-norm  (i.e.,  II  y  !1  =  H  V II2  ~ 

T  1  '  ***  k  k  k 

(y  y)  In  this  case,  any  s  that  minimizes  q  (s)  subject  to  s  c  T 

satisfies 

(3.11)  (H*^  +  =  -V^(x'^)  , 

k  k  k  k  2  k 

where  ^  0,  makes  H  +  D  positive  semidef  inite ,  and  X^  =  0  if 

k  k  k 

II D  s  IL  <  6  (see  lGay79)).  In  practice,  if  X  is  positive,  then  it  usually  cannot 

2  * 

be  determined  exactly,  and  attempting  to  approximate  it  with  high  accuracy  would  be 

inefficient.  Good  performance  is  usually  obtained  in  this  case  if  s  is  chosen  to 

k  k  k  k  k 

satisfy  (3.11)  with  X^,  replaced  by  any  X  that  yields  a  <5  ^  II D  s  II.,  ^  i  6  for 

some  fixed  a  <  1  and  6  >  1.  (Hebden  [Heb73)  and  More  [Mor78)  choose  a  =  0.^''  and 

6  =  1.1,  while  Dennis  and  Schnabel  [DenS79)  suggest  a  =  0.75  and  f  =  1.5.  In  ti-.e 


context  of  NL2S0L  (DenGW791,  we  felt  that  the  former  choice  gave  slightly  better 

performance  than  the  latter.  )  An  acceptable  X  can  usually  be  obtained  in  one  or 

two  tries  using  an  iteration  proposed  independently  by  Hebden  (Heb73]  and  Reinsch 

[Rei71)  together  with  Morel's  variation  [Mor781  of  Hebden's  safeguarding  scheme. 

k  k  2 

A  simple  way  of  detecting  and  handling  the  exceptional  case  where  H  +  D  is 

nearly  (or  actually)  singular  is  described  in  (Gay 79]. 

k  2  k 

In  cases  where  ^  ^  ^  ^  ),  it  is  possible  to 

2  k 

avoid  computing  .  v  (x  )  explicitly  and  to  work  with  a  QR  factorization  of 
k  1/2 

P(.  "(r(x  )))  A  by  carrying  out  calculations  in  the  way  described  by  Mor^  [Mor78] . 

The  discussion  above  and  in  Appendix  A  about  carrying  out  a  QR  factorization  in 
connection  with  (3.5)  applies  here  to  the  initial  QR  factorization  (for  each  k 

where  p*'(r(x  ))  has  one  or  more  components  near  zero). 

k+1 

Let  us  now  consider  the  new  trust-region  radius  6  .  Various  choices  for 

k+1 

£  have  been  suggested,  such  as  those  in  [Pow70b,d)  ,  (Heb73]  ,  (Mor781  ,  (E)enGW79]. 

k+1  k  k  k+1  k  k 

They  generally  have  the  form  6  =  p  6  or  6  =  u  IlD  s  II  ,  and  the  decision 

k  k 

whether  to  choose  p  <  1  or  u  >1  is  generally  based  on  whether 

(3.12)  (x”^)  -  (x*'  +  s*^)  <  Cj^  [q*' (0)  -  q*'(s*')) 

for  some  fixed  c^^.  (Usually  Cj^  =  0.25  or  0.1,  and  sometimes  <  is  changed  to 

£  in  (3.12).)  When  (3.12)  holds,  Mor^  (Mor78]  chooses 

(3.13)  =  maxfO.l,  min{0.5,  9*}}  , 

where  6*  minimizes  the  quadratic  polynomial  that  fits  y(0),  y'  (0),  amd  yd)  for 

(3.14)  yO)  =  r-(x’'  +  9  s’')  , 

i.e.  ,  S*  =  j  y'  (0)/ty '  (0)  +  y(0)  -  y  (1) )  (for  yd)  ^  y  (0)  +  y’  (0) ) .  (This  idea 

for  9*  stems  from  [Fie 71].)  Hebden  [Heb73]  also  uses  (3.13),  but  his  6*  minimizes 

the  cubic  polynomial  that  fits  y(0),  y'  (0),  y"(0),  and  yd),  i.e.  9*  = 

2  k  k  k  k 

[-y"  (0)  +  r  (y"(0))  +  2ny'  (0)],'ri,  where  n  =  6  [.y  (x  +  s  )  -  q  (s  )].  Both  schemes  seem 
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reasonable  for  the  problems  at  hand.  For  cases  where  (3.12)  fails,  many  choices 
for  .  have  been  proposed.  A  new  one  in  the  spirit  of  (3.13)  is 

(3.15)  -  max\l,  min  {4,  0*1}  , 


where  •*  has  one  of  the  values  just  described.  The  relative  merit  of  (3.15) 
remains  to  be  seen. 

In  cases  where  X  >0,  it  may  be  reasonable  to  replace  (3.14)  by 


(3.16)  y(6)  =  v-(x’‘  +  s(9))  , 

k  k 

where  s  O)  minimizes  q  (s)  subject  to  IlDsIl^  ^  oil  D  s  ll^-  Since 
s'  (0)  =  IlD  S^l2  7  y(x'^)/llD'^  V  v'(x’^)|l2,  the  >  of  (3.16)  has 
(0)  =  -IlD  ^  V  s;(x*^)ll2  ■  IlD  s*^ll  2 ,  and  since  s"(0)'^  7  ^:(x'^)  =  0, 

,"(0)  =  IlD  s^l^  IVv(x’^)'^  d'^  7^  v(x'‘)  •  d”^  V  v'(x’‘)]/IId'^  V  v'(x’^)ll2.  Note  that 
k  k  k 

,  (0)  =  y  (x  )  and  y(l)  =  ■;  (x  +  s  )  are  readily  available. 

)c+l 

In  addition  to  worliing  well  in  practice,  choices  for  i  of  the  sort  just 

discussed  ma)ce  it  possible  to  prove  convergence  theorems.  Indeed,  it  is  easily  shown 
)c  2  )c  k 

that  if  H  =7  ^  (x  )  and  some  iterate  x  comes  close  enough  to  a  strong  local 

minimizer  x*,  i.e.,  a  point  where  V,;(x*)  =  0  and  7^  ;(x*)  is  positive  definite, 

then  the  iteration  soon  reduces  to  Newton's  method  and  the  iterates  converge 
Q-quadratically  to  x*. 

Let  us  briefly  consider  what  can  be  said  more  generally  about  convergence. 

Powell  has  studied  a  large  class  of  model/trust-region  algorithms  in  which 

k  k+1  k  k  k+1  k 

c_  i  ^  ■  <  c,  f  when  (3.12)  holds  and  i  <  5  <  c.  S  otherwise  (with 

2  —  —  3  —  —  4 

*^3^^  and  c^  i  I'-)!  ^nd  in  which  H  can  be  any  symmetric  matrix  in 
^  (such  as  7^  y(x*^)  or  an  approximation  thereto  produced  by  a  secant  update) 


that  satisfies  the  iDOunded 

If 


,k  ,  .max  ,  , ,  , 
i  '  ^  for  all  k 


X 


yll 


.max 


k  ^  i 

deterioration  condition  II H  II  <  c..  +  c,  )  II  s  II . 

-  5  ^  i=0 

and  if  ^  is  bounded  and  uniformly  continuous  on 
for  some  y  with  ■^(y)  ^  {x  ))  (which  is  easily  seen 
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to  be  the  case  for  (2.1-6)),  then  Theorem  1  of  [pow75)  asserts  that 

(3.17)  lim  inf  Il7v'(x’^)ll  =  0  . 

k  O' 

k  k+1  k 

If  the  rule  for  updating  x  is  changed  so  that  x  =  x  if  (3.12)  holds  and 

k + 1  k  k  k 

X  =  X  +  s  otherwise,  then  (3.17)  is  easily  strengthened  to  lim  7y' (x  )  =  0. 

k-^ 

0  12 

Let  X*  denote  the  set  of  limit  points  of  the  sequence  x  ,  x  ,  x  ,  •  •  *  generated 

using  the  modified  x  updating  rule.  Continuity  of  7y  in^lies  7y''(x*)  -  0 

for  all  X*  e  X*,  and  if  X*  contains  just  one  point  x*,  then  lira  x  =  x*.  If 

k  2  k  2 

H  =  V  y(x  )  and  X*  contains  raore  than  one  point,  then  7  v(x*)  is  positive 

serai  definite  and  singular  at  each  x*  c  X*.  Although  pathological  cases  where  X* 

contains  two  or  more  points  are  extremely  unlikely  in  practice,  I  know  of  no  easy 

way  to  exclude  them  in  theory.  So  far  as  the  rule  for  updating  x  is  concerned,  it 

k+1  k  k  k  k  k 

may  be  better  in  practice  to  set  x  =  x  +  s  whenever  ■^(x  +  s  )  <  sJ  (x  )  . 

since  (in  our  experience)  this  leads  to  faster  convergence  more  often  than  not. 

When  iterative  methods  are  used  to  solve  a  problem,  the  overall  time  required  to 

find  cin  acceptable  solution  may  be  less  for  a  more  slowly  convergent  method  them  for 

a  more  rapidly  convergent  one  that  requires  more  work  per  iteration.  One  way  to 

reduce  the  work  per  iteration  in  the  above  model/trust-region  approach  is  by  some- 

k  k-1  k-1  k 

times  choosing  H  to  be  H  or  a  cheaply  updated  version  of  H  .  If  H  is 

2  k  k-1 

periodically  chosen  to  be  V  •^(x  )  and  is  otherwise  set  equal  to  H  (which 

Traub  suggested  in  §11.3  of  [Tra64]),  then  it  is  possible  to  use  Brent's  ideas  lBr«'73) 

to  optimize  the  asynptotic  efficiency  of  the  iteration  once  it  has  reduced  to  (3.6) 

with  =  1.  In  practice,  it  might  be  better  to  use  em  idea  of  Todd  [SaiT78)  ■ 

k  2  k  k 

periodically  choose  H  to  be  V  ^  (x  )  and  otherwise  obtain  H  by  performing  a 

secant  update  on  A  similar  idea  is  possible  with  given  periodically  by 

(3.8).  At  this  point  it  is  not  clear  just  which  combination  of  the  ideas  presented 

in  this  section  will  minimize  the  time  needed  to  find  an  acceptable  local  minimi  zer 


of  (1.1)  when  r(x)  has  the  form  (3.1). 


4.  Nonlinear  Problems 


In  this  section  we  generalize  the  discussion  in  53  to  the  case  where  the  residual 
vector  r(x)  is  a  twice  differentiable  function  of  x.  In  place  of  (3.2)  and  (3.3), 
we  then  have 

V>p(x)  =  J(x)^  b'  (r(x))  and 
2 


(4.1) 

(4.2a) 

(4.2b) 


^(x)  =  G(x)  +  S(x)  where 


G(x)  =  J(x)  P(p"  (r  (x)  ) )  J  (x)  and 


(4.2c) 


S(x)  =  y  p!  (r,  (x) ) V  r.  (x) 
.,11  1 
x=l 


Aside  from  direct  search  algorithms  (i.e.,  pattern  searchers,  such  as  the  method 
of  sinplices  [NelM65]),  most  minimization  algorithms  require  a  good  approximation  to 
the  gradient  of  the  objective  function.  It  is  usually  possible  to  obtain  an  acceptable 
approximation  by  finite  differences.  For  the  objective  function  v"  of  present 
concern,  it  is  slightly  cheaper  to  confute  a  finite-difference  approximation  J(x,  h; 
to  J  (x)  and  then  coitpute  the  approximate  gradient 

(4.3)  7‘^(x,  h)  =  J(x,  h)^  p'(r(x)) 

than  it  is  to  approximate  V.;(x)  directly  by  finite  differences,  since  the  criterion 

functions  need  only  be  evaluated  at  r(x)  when  (4.3)  is  used.  Thus  it  is  reasonable 

to  assume  that  a  good  approximation  to  J(x)  is  available,  and  to  siiT5>lify  the 

following  notation  and  discussion,  we  henceforth  assume  that  J(x)  itself  is  available. 

2 

Because  of  S(x)  in  (4.2),  algorithims  explicitly  using  7  v'(x)  would  be 
difficult  to  implement  and  ejqjensive  to  run.  On  prcblems  where  S(x)  is  li)cely  to  be 
small  at  the  desired  solution,  it  is  reasonable  to  discard  S(x)  and  thus  replace 

2  )t  )c  )t 

.  ‘^(x  )  by  its  Gauss-Newton  approximation  H'  =  G(x  ).  Using  this  H  in  the 

methods  of  53,  we  obtain  the  Gauss-Newton  analogs  of  these  methods. 

)( 

For  many  nonlinear  problems,  G(x  )  gives  only  a  poor  approximation  to 
_2  )c  )c 

V  ^  lx  ) ,  and  tjetter  performance  may  well  be  obtained  if  H  is  generated  by  a 
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sec;\nt  update  method  [DenM77].  However,  since  the  Gauss-Newton  part  G(x  )  of  the 
2  k 

true  Hessian  7  v  (x  )  is  readily  available,  it  seems  reasonable  to  approximate  only 
the  expensive  part  S(x  )  of  (4.2)  by  a  secant  update  scheme.  This  works  well  in 
the  particular  case  of  nonlinear  least-squares  lDenGW791,  and  it  is  natural  to  expect 
that  the  obvious  generalizations  sketched  below  of  the  techniques  recommended  in 

[DenW78]  and  [DenGW791  may  prove  worthwhile  for  other  problems  of  the  form  (1.1). 

k  k 

Let  us  first  consider  how  to  update  a  current  approximation  S  to  S(x  ). 
k  k+1  k  k 

Suppose  step  s  has  been  taken  (i.e.,  x  x  +  s  ) .  As  in  [i:)enGW79],  it  seems 

k+i  V  k+l  k 

reasonable  to  require  that  S  s  =  S(x  )s  ,  and 

k+1,  k  ’v  ,  ,  ,  k+l,,„2  ,  k+1,  k 

S(x  )s  =  >  p;  (r.  (x  ))  V  r.  (x  )s 

i=l  "  ^ 


=  I  pMr^  (x’^'^^))  [V  r^(x*‘'^^)  -  V  r^  (x’^)  ] 
i=l 

T 

r^,  k+1  ,,k'j  k+1,, 

=  |^J(x  )  -  J  (x  )J  p '  (r  (x  ) )  , 


k+1  k  k  r  k+1,  k.V 

S  s  =  y  :=  (x  )  -  J  (x  )  J 

„k+l 


p'  (r  (x'^^^) ) 


so  we  shall  require 

(4.4) 

While  there  are  many  ways  of  obtaining  an  that  satisfies  (4.4),  the  reasoning 

in  §3  of  [DenGW79]  and  in  [Gay 76]  suggests  using 

(4.5a) 

(4.5b) 


,  k  _  ,  k+1.  „  ,  k. 

A  g  :=  Vs!  (x  )  -  V<f  {x  ) 


k  ,  k  ,,  ,  k.  T  k, 
V  :=  A  g  /[  (s  )  A  g  ) 


(4.5c) 

(4.5d) 

(4.5e) 


k  k  k  k 
z  :=  y  -  S  s 


k  k  1  , ,  k.T  k  k 
w  :=  z  -  J  [  (z  )  s  ]v 


_k+l  k  k  k, T  k  k . T 
S  :=  S  +  V  (w  )  +  w  (v  ) 


)c  T  Jc  }c  )c 

at  least  when  (s  )  A  g  >0.  To  handle  cases  where  s  and  A  g  are  nearly 

orthogonal,  it  seems  reasonable  to  replace  (4.5b)  by 
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[o  if  g  =  0 
(4.6)  V  :=  •{ 

gV  Isign  ( (s^ ) '^  .i  g’')max{  [  (s*^) i  g"^  | , 
otherwise , 

-4  -6 

where  is  a  small  positive  number  (such  as  10  or  10  ). 

For  the  nonlinear  least-squares  problem,  it  proved  worthwhile  to  size  S 


before  updating  it,  i.e.,  to  replace  S*^  in  (4.5)  by  mind,  |t  1  }S  ,  where 
(4.  7) 


k  ,  k.T  lc,,,)c.Tk  )c, 
i  =  (y  )  s  /  ( (s  )  s  s  1  . 


1<  )c 

Multiplying  S  by  t  shifts  its  spectrum  (interval  from  minimum  to  maximum 

k+1 

eigenvalue),  maJcing  it  more  likely  to  overlap  that  of  S(x  ),  which  has  the  happy 
k + 1  k 

effect  of  making  S  small  in  cases  where  S(x  )  is  small. 

labile  working  with  nonlinear  least-squares  problems,  we  tried  several  other 
candidates  for  t  ,  including  Welsch's  proposal  in  [DenW781  ,  and  concluded  that  (4.7) 
looked  best  on  the  problems  considered.  Whether  the  same  conclusion  holds  for  the 
more  general  problems  of  interest  here  remains  to  be  seen. 

Our  experience  with  nonlinear  least-squares  suggests  using  a  model/trust-region 

approach  in  which  there  are  two  models;  the  Gauss-Newton  model ,  in  which 

k  k  k  k  k 

H  =  G(x  ),  and  the  augmented  model,  in  which  H  =  G(x  )  +  S  .  It)  decide  which 

k+2 

model  to  use  in  determining  x  ,  we  found  the  following  rule  adequate:  if 

,,  k,  k,  ,  k  k,  I  I -k ,  k,  ,  k  k,  1 

(4.8)  ;q  (s  )  -  ,;  (x  +  s  )  1  ^  Cg  |q  (s  )  -  V'  (x  +  s  )  I  , 


k  k 

where  q  is  the  current  model  and  q  is  the  alternate  model,  then  retain  the  same 

model  preference;  otherwise  switch  models.  (We  found  c_  =  to  work  well  in  (4.8), 

6 

though  we  did  not  experiment  much  with  this  constant.)  This  is  called  adaptive 
modeling. 

An  in^iortant  practical  detail  is  the  matter  of  deciding  when  a.n  acceptable 
solution  has  been  found.  The  discussion  in  §6  of  [DenGW791  generalizes  readily  to 
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objective  functions  of  the  form  (1.1).  Specifically,  if  1.x*)  =  0  with  '  (r(x*))  ^ 
then  (4.1)  implies  that  p'  (r(x*))  is  orthogonal  to  the  columns  of  J(x*),  which 
suggests  chec)cing  for  cosine  convergence,  i.e.. 


(4.9)  max 


I  I  .  ,  ,  k,  ,T  ,  ,  )c,  1 

:  |p'  (r  (x  ) )  (x  )  j 
Ho'  (r(x’‘)ll2  IIJ.  (x’")!!^ 


:  IIJ.  (X  )ll2  >  tj.  ,  1  1  i  _  pj 


where  J.  denotes  column  i  of  J  and  e,.  >0  is  a  tolerance  used  to  decide  whether 

1  Ji  — 

should  be  regarded  as  a  zero  vector.  Except  for  the  choice  of  the  test 

(4.9)  has  the  advantage  of  being  unaffected  by  how  the  components  of  x  are  scaled. 

On  problems  where  o'  (r(x*))  =0,  a  test  of  the  form  (4.9)  may  fail  no  matter  how 
close  X  is  to  X*,  so  it  is  also  necessary  to  chec)c  for  generalized  residual 
convergence ,  i.e. 


(4.10)  Ho'  (r(x  ))ll  <.  , 

which  unfortunately  is  sensitive  to  the  scale  of  r(x).  To  handle  cases  where  e  and 


are  too  small  for  the  precision  of  the  arithmetic  being  used,  it  is  advisable  to 

)t  k  Ic 

check  for  X  convergence  -  whether  s  is  small  relative  to  x  when  s  is  rejected, 
k +1  k 

i.e.,  when  x  =  x  .  One  way  to  do  this  is  by  checking  whether 


:  1  1  i  1  P  1  , 


where  fl  (•  )  denotes  the  confuted  value  of  (•).  This  X  convergence  test  is 

k  k 

independent  of  the  scale  of  the  conponents  of  x  and  s  ,  but  may  have  trouble  if 

k  . 

x^  *  0  for  some  i.  We  cam,  of  course,  replace  the  componentwise  test  (4.11)  by  one 
of  the  form 


1  fl  (x’'  +  s’' 

1  l"i  ^  "i  1 

(4.12)  llD[fl(x’‘  +  s’')  -  x'']ll  <  Ejj  IID  x’'ll  , 

which  is  scale-invaricint  if  the  scale  matrix  D  is  chosen  properly  and  only  fails  to 
perform  properly  if  x*  =  0.  (We  may  wish  to  change  D  from  one  iteration  to  the 
next  -  see  57  of  [DenGW79].)  It  may  also  te  reasonable  to  consider  a  generalization 
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of  the  variability  convergenct  test  described  in  lDenGW79].  Although  it  may  not 

^  k  2 

always  be  clear  how  to  generalize  the  scale  factor  ^  r.  (x  )  /max  i,  n  -  p  used 

i=l 

for  nonlinear  least-squares,  it  is  clear  that  the  alternate  interpretation  of  the 
variability  convergence  test  in  the  case  of  a  full  Newton  step  generalizes  to  a  test 
on  the  predicted  change  yet  possible  in  the  objective  function.  TO  handle  cases 
where  s^  is  not  a  Newton  step  Ci,e.,  s  f  (x  )),  it  may  be  best  to  check 

whether,  say, 

(4.13)  ‘^(x^)  -  min{q*' (s*')  ,  (x^  +  s^)}  £  •  |s:(x*')|  , 

which  if  true,  ndght  be  called  function  convergence.  For  reasonable  values  of 
it  is  li)cely  that  (4.13)  would  be  satisfied  sooner  than  (4.9-12). 


Ac)cnowledqiwent 

I  than)c  Roy  E.  Wtelsch  for  some  helpful  discussions  and  for  useful  comments  on 
a  draft  of  this  paper. 
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1/2  “  “1/2 

Appendix  A:  Confuting  (D  A)  (D _ ^ 

In  this  appendix  we  consider  in  more  detail  a  problem  in  numerical  linear 
algebra  from  :  given  A  e  of  rank  p,  be  k”  ,  and  D  =  diag  ,  ’  *  *  » 

with  d^>0,  l^i^n,  compute  a  good  approximation  s  to 


(A.l) 


-T  --1-T- 
S*  =  (A  D  A)  A  b 


using  finite-precision  arithmetic.  Let 


(A. 2a) 
(A. 2b) 


1/2  - 

A  =  D  A  and 
b=D-^/^b  , 


T  “1  T 

SO  that  (A.l)  amounts  to  s*  =  (A  A)  A  b.  How  easily  we  can  compute  a  good  s 
depends  largely  on  the  condition  numJaer  <  of  A,  i.e.,  the  ratio  of  largest  to 
smallest  singular  value  of  A,  which  is  given  by 


(A.  3) 


r  T  T  -1 

=  II A  AIL  II  (A  A) 


'2  .  2 

1 

Let  fl(-)  denote  the  result  of  confuting  (•)  in  the  available  finite-precision 
arithmetic.  We  assume  this  to  be  floating-point  arithmetic,  so  that  if  o£^  denotes 
one  of  the  four  elementary  arithmetic  operations  and  a  og^  8  is  defined  and  in  range, 
then  fl  (a  og^  S)  =  (a  og^  3)  (1  +  n)  for  some  n  with  |  n  |  ^  ^MACH'  "MACH 

"machine  epsilon")  depends  only  on  the  arithmetic  being  used.  (If  binary  floating-point 
arithmetic  having  t  bits  in  the  fraction  is  used,  then  ^^^ACH  generally  2 
or  2^  depending  on  whether  results  are  rounded  or  truncated.  It  is  often 
satisfactory  to  define  smallest  positive  floating-point  number  such 

that  fid  .  >1  and  fl  (1  -  c^^„)  <  1.) 

-1/2 

If  <  is  sufficiently  less  than  then  the  most  efficient  way  to  compute 

-•p  « 

an  acceptable  approximation  s  to  s*  is  to  explicitly  confute  f  1  (A  D  A)  and 

—  P*  —  —'p  —  — T  — 

solve  f  1  (A  D  A)s  =  fl  (A  b)  by  computing  a  Cholesky  factorization  of  fl  (A  DA).  A 

roundoff  analysis  of  this  procedure  leads  to  a  relative  error  bound  analogous  to 

2 

(A.  10)  below  (provided  that  <  c  is  not  too  large), 

MACH 


••if  •Ift’ mJ*'  ^ 
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Unfortunately,  people  often  overspecify  their  models,  which  can  easily  lead  to 

-1/2 
^  MACH 


-T  - 

.  In  this  case  it  is  possible  for  fl  (A  DA)  to  be  indefinite  or  singular 


-1/2 

and  the  procedure  just  sketched  can  break  down  or  yield  a  poor  s.  If  ^  Siach' 


-1 


then,  as  indicated  in  3  3,  it  is  usually  possible  to  obtain  a  much  better  s  by 

1/2  - 

properly  computing  a  QR  factorization  of  A  =  D  A  and  approximately  solving 
Rs  =  b.  Vfe  now  consider  how  to  properly  confute  a  QR  factorization  in  cases 
where  some  d^  may  be  near  zero. 

In  general,  if  A  £  ]R*^  ^  and  the  columns  of  A  are  readily  available,  it  is 
common  practice  to  compute  a  QR  factorization  of  A  by  means  of  Householder  transfor¬ 


mations  (elementary  reflectors).  In  this  case  Q  amounts  to  the  first  p  columns  of 

k  .....  K.  2,t  k  ,  k,T 


1  2 

the  product  Q  Q 


0^,  where  Q  =  I 


2  (li  uN  p  u‘'(u'')‘,  T  = 


if  T  ^  0 


0  if  T  =  0 


for  T  e  K,  and  €  b"  is  chosen  so  that  ^  := 


Q  A  has 


k+1  1 

A  =  0  for  i>  j,  lljlk.  Starting  from  A  :=  A,  this  is  accomplished  by 
1 1 D 


choosing 


0  if  i  <  k 


k  /  k  .  ,  k  , 

"i  =  \,k  ^ 


I  (a5,)^ 

j=k  . 


1/2 


if  i  =  k 


a'^  ,  if  i  >  k 
1  ,k 


T  P"*"! 

with  this  scheme,  Q  b  is  the  first  p  components  of  the  b  confuted  by 


b’-  :=  b,  b"^^ 


b*^  -  2(llu*'ll2)  t(u*')^  b'^]u*^  for  1  ^  k  ^  p.  Unfortunately,  if 


d^  =  0  for  some  k  <  p,  then  this  scheme  can  lead  to  severe  cancellation  errors  in 
components  k  through  p  of  fl  (Q^  •  •  •  ^  b  ) .  Si^pose  for  exanple  that 


/  1 


.01  2 

1  1 


(A. 4) 


A  = 


.01 

.01  \ 

i 

2 

1  ' 

and  b  =  j 

^  1 

1  / 

\ 

-19- 


If  we  use  3  decimal  rounded  floating-point  arithmetic  “  .005)  and  obtain  s 

}c  k 

from  a  QR  factorization  of  A  conputed  as  above  (using  division  by  fl  (u^^  ' 

k  -2  k  k  -1 

in  place  of  multiplication  by  211  u  il^  =  ^  k^  then  we  get 

/  3.53  ■  _  2.51226  \ 

s  =  ’  1  ,  whereas  s*  =  ,  I  .  If  the  components  of  A,  b,  and  d  had 

V  -.518  \  .4874  39/' 

been  ordered  to  make  d^  the  small  one,  then  the  computed  s  would  have  been  much 
more  accurate.  For  example,  if  we  interchange  rows  1  and  3,  so  that 


(A. 5)  A  =  .01  2  I  and  b  =  1 

'  /  \  ' 

\ .01  .01  '  \  100  ' 

/2.52\ 

then  we  obtain  s  =  1  •  This  illustrates  a  way  of  avoiding  disasterous  cancel- 

V  .490/ 

lation  errors  when  using  Householder  transformations:  we  cound  introduce  row  pivoting, 

k  k  •  k 

i.e.,  before  generating  Q  ,  we  could  arrange  that  |Aj^  1  =  max{  ;A^  k^  '  ^ 

k  • 

by  interchcuiging  two  rows  of  [A  :  b  ]  if  necessary.  (Equivantly,  we  could  use  a 
permutation  vector  in  the  obvious  way.) 

A  QR  factorization  confuted  using  Givens  transformations  (plane  rotations)  can 

also  suffer  disasterous  cancellation  errors  in  some  cases.  With  this  scheme,  Q  is 

k  2 

the  first  p  columns  of  a  product  of  matrices  Q  ’  of  the  form 

k  i 

cos  0'  if  i  =  j  =  k  or  i  =  j  =  2. 
k  2 

sin  9  '  if  i  =  k  and  j  =  2 


Q.  .  =  1  -sin 
1/3 


if  i  =  2  and  j  =  k 


1  if  2/i  =  j?<k 


0  otherwise. 


If  A  and  b  are  given  by  (A. 4),  for  exanple,  and  we  coii5>ute  s  using  the  Givens 
12  13  2  3 

transformations  Q  '  Q  '  Q  '  computed  in  the  same  3  decimal  arithmetic  as  before, 


then  we  obtain  s  = 


2.35N^ 

.65oJ 


The  substcintial  error  in  this  s  comes  about  because 


1 
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using  one  of  the  schemes  sketched  above,  it  is  generally  possible  to  show  that 


(A. 6a) 


A  +  E  =  QR  and 


(A. 6b)  V  +  f  =  b  , 

where  il  E  II  =  ■  II  All)  and  II  f  II  =  ■  llbll),  provided  n  •  is 

small  (e.g.  n  •  c  <  .01).  Wilkinson  [Wil65i  does  this  for  Householder  transfor- 

MACH  — 

mations,  Gentleman  [Gen73)  does  this  for  (standard  and  fast)  Givens  transformations, 
and  BjOrck  [Bjt56  7]  does  this  for  the  stabilized  Gram- Schmidt  process.  Relatively 
little  error  occurs  when  we  compute  s  by  solving  Rs  =  v;  we  could  take  this  small 

error  into  account  by  adjusting  E  and  f,  but  doing  so  would  not  change  the 

character  of  the  bounds  on  E  and  f,  so  we  shall  assume  this  error  to  be  zero. 
Another  source  of  error  with  the  stabilized  Gram-Schmidt  process  is  the  fact  that  t.he 
columns  of  the  computed  Q  are  usually  not  truly  orthonormal.  Since  this  fact  also 
does  not  change  the  nature  of  the  perturbation  bound  (A. 10)  given  below  (see  [Bjii671), 
we  shall  ignore  it  too.  Thus  we  consider  that  the  computed  s  exactly  solves  a 
modified  problem:  s  =  (A  +  E)  ‘  (b  -  Qf )  ,  where  (A  +  E)  =  [  (A  +  E)'^(A  +  E)]'^  (A+E)'^ 
denotes  the  pseudoinverse  of  A  +  E,  which,  like  A,  we  assume  to  have  rank  p. 

In  discussing  the  difference  between  the  computed  s  and  the  desired  one, 

i.e.,  s  -  s*  =  (A  +  E)  (b  -  Qf)  -  A  b,  it  is  useful  to  use  the  notation  for 

X  - 1  T 

orthogonal  projection  onto  the  column  space  of  A  :  P^  =  A  A  =  A  (A  A)  A  .  In 
this  notation,  I  -  P^  denotes  (orthogonal)  projection  onto  the  orthogonal  complement 
of  the  column  space  of  A. 

When  A  and  b  come  from  a  general  linear  least-squares  problem, 

II  (I  ■  often  not  too  large,  in  which  case  the  perturbation  bounds 

that  Stewart  derives  in  [Ste691  are  quite  satisfactory:  Theorem  6.2  of  [Ste69)  shows 

for,  say,  k.  IIP,  Ell /II All  <  1/2  that 
A  — 


(A.  7) 


II  (A>E)%-a'  bll  "^a"".  2"<'-'’a>'^"  "(I-Pa)EII  3ll(I-P^)Ell  _ 

^  *  II  .  II  t  K  •  II  „  ,_ll  •  II  R  II  +  <  *  _  )  , 


II A  bll 


I  A  II 


IIP,  bll 
A 


II  Air 


where  <  is  given  by  (A.  3).  When  A  and  b  come  from  (A. 2),  however, 


-22- 


II  (I  -  P  )bll/IIP  b  II  and  therewith  the  right-hand  side  of  (A. 7)  can  be  arbitrarily 
A  A 

T 

large.  Fortunately,  the  left-hand  side  of  (A. 7)  only  depends  on  A,  E,  A  b,  and 
b.  Indeed,  we  prove  in  Appendix  B  that  if,  say,  k  II  Ell /II A  II  £  1/2,  then 


(A.8)  “  ^  ^  ^  .  0(K^ 


llA  bl 


I  Ell  _IIe'^  b 

I  A  II  II  A 


P  b  I 
A 


~  2  II  (A  +  E)'^  b  II 

II A  111  II A  II  IIP,  bll  ' 


It  is  straightforward  to  modify  the  derivations  in  lBj667]  to  show  that 


(A. 9a) 


°n,p<^MACH  • 


(A.  9b) 


If  II  =  0 

n  ,p  MACT 


lb  I/ll  All  ) 


for  the  stabilized  Gr am- Scjiini dt  process,  where  a  e  ]R  has  ith  component 

a,  =  max{|A.  .[  :  1  1.  j  ^  P }  [b  |  e  Ir''  has  ith  component  jb.  1-  Together  wit 

i  1 ,  j  1 

(A. 6)  and  (A.8),  this  may  be  used  to  show  that 


(A. 10) 


(A  +  E)  (b  -  pf )  -  A  b  I 

II  A*  b  II 


n,p 


r 


a'^lbi 


MACH 


iP  b 
A 


+  1 


(for,  say,  <  •  P)  11/2). 


While  I  suspect  that  (A. 9)  usually  holds  when  Householder  or  Givens  transforma¬ 
tions  with  row  pivoting  are  used  to  compute  a  QR  factorization,  at  this  writing  I 
am  unable  to  obtain  satisfactory  bounds  for  these  procedures. 
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Appendix  B;  Bounding  (A  +  E)  b  -  A  b 

In  this  appendix  we  may  gain  some  generality  by  allowing  A,  E,  and  b  to 


complex  components.  Ihus  we  assume  that 


b  .  a:"'P,  E  c  i" 


and  that  A 


rank  p.  Superscript  H  stands  for  conjugate  transpose,  and  •  li  denotes  the  Euclide^ 

H  1/2 

vector  norm  11x11=  II  xll  ^  :=  {x  x)  '  or  the  corresponding  induced  matrix  norm: 

IIaH  :=  maxdlAxll  :  11x11  =  1}.  As  before,  <  denotes  the  condition  number  of  ,A,  i.e., 

K  =  [IIa*^  All  II  (A^  A)  ^11]^^^,  A  =  (A^  A)  ^  A*^  is  the  pseudoinverse  of  A,  and 

=  A  A  denotes  the  operator  that  projects  orthogonally  onto  the  column  space  of  A. 

It  will  be  convenient  to  use  the  notation 
(B.la)  S  :=  <  II  Ell /II  All 

(B.lb)  :=  K  II  Ell /II  All 

It  will  also  be  convenient  to  exploit  the  fact  that  A  has  a  QR  decomposition: 

A  =  QR,  where  Q  e  has  orthonormal  columns  (i.e.  ,  Q^  Q  =  I)  and  R  ^  (C^"^ 

is  upper  triangular  with  II  rII  =  II  All  ,  II R  II  =  II A  II  =  k/IIaII  .  Note  that  P^  =  Q  Q  . 

Theorem  B.  3  below  rests  heavily  on  Lemma  B.2,  which  in  turn  relies  on  the 
following  simple  lemma: 

Lemma  B.l  Assume  F  ^  and  let  =  P^  F.  If  IlF^II  <  1,  then  (Q  +  F)”(Q  +  F) 

is  nonsingular  and 

4llF.l/  +  IlFlI^ll  +  2llF  I!) 

(8.2)  II  I  -  Q  F  -  F  Q  -  t  (Q  +  F)  (Q  +  F)  ]  II  <_  - - - - - i -  . 

(1  -  IlF^II)^ 

Prop  f :  For  any  x  c 

II  (Q  +  F)xll  ^  IIq”(Q  +  F)xll  =  II  (I  +  q“  F^)xII  >  (1  -  II  F^ll  )llxll  , 
so  the  smallest  eigenvalue  of  (Q  +  F)^(Q  +  F)  is  at  least  (1  -  IlFj^ll)^,  whence 


(Q  +  F)  (Q  +  F)  is  nonsingular  and 


KQ  +  F)”(Q  +  F)l"^ll  1  (1  -  IlFj^ll)"^ 


Now  [  (Q  +  F)”(Q  +  F)  )  [I  -  q”  F  -  f”  Ql  =  I  -  (q”  F  +  F*^  Q)  (j”  f  +  V"  ■;;;)  ♦ 

+  f“  F(I  -  F  -  f“q),  so 

I  -  2”  F  -  Q  -  [  (Q  +  +  F))*^  = 

(B.4) 

=  I  (2  +  F)“(2  +  F)l'^[-(2“  F  +  f”  2)  (c”  f  +  f“2)  +  f“  f(i  -  2”  f  -  f“  2))  . 

and  (B.2)  follows  readily  from  (B.3),  (B.4),  and  the  fact  that  Il2^  F  ii  =  IIf'^  2  = 

I^mma  B.2  Let  F  =  E  r“^,  F^  =  F,  and  assume  ^  I-  Then  llF^ll  ^  1, 

(A  +  E)^(A  +E)  is  nonsingular,  and  if  is  the  orthogonal  projection  onto  the 

column  space  of  A  +  E,  i.e.,  =  (A  +  E) (A  +  E)  ,  then 


-  PJi’"  <  IIf”(I  -  2  2“)bll  +  II F,  II  II  f“  bll  +  [211  Fll  llF,ll  + 

A  Art,  A  ”  1  1 


(1  +  II  Fll  )  (411 F  11^  +  IlFll^d  +  2llF  ID) 

- - - 5 - - - ]  11(2  +  F)  bll 

(1  -  IlF^II) 


-  ilAf  [IIe”(i  -  P^)b  II  +  He”  b  II  + 

[1  +  Sl[4  +  2  S,  S^l 

+  (2  SB  +  - - - 5 - - - )  II  (A  +  E)  bll] 

[1  -  6^1 

Proof:  We  have  IIF  II  =  IIP  E  r'^II  <  IIr'^II  ||P  E  H  <  k  Up  Ell/ll  All  =  S,  ,  so 
I  A  —  A  —  A  1 

(2  +  F)^(2  +  F)  and  (A  +  E)^(A  +  E)  =  B^(2  +  F)^(2  +  F)R  are  nonsingular  by 


Lemma  B.l. 

Let  M  =  [  (2  +  F)“(2  +  F))"^  -  [I  -  2“  F  -  f“  21-  Since 

Pa+e  ^  (Q  +  F)((2  +  F)"{2  +  F)]"^(2  +  F)”,  we  have 

Pa+e  ■  '’a  °  (Q  +  F)  [I  -  2”  F  -  f“  21  (Q  +  F)”  -  2  q”  +  (2  +  f).m(2  +  F)“ 

(B.6) 

=  [I  -  2  q”if(2  +  f)”  +  2  f”(i  -  2  2”)  -  2  f“  2  f“  + 

+  t  (2  +  F)M  -  F  (2”  F  +  f”  21 1  (Q  +  F)”  . 

Now  II2II  1,  P^  “  2  2  ,  and  P^IT  "  P^l  ~  first  inequality  in  (B.5) 
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follows  from  (B.6)  aiid  Lemma  B.l  (which  bounds  II  Mil  ) .  As  above,  we  have 


and  similarly  IIfII  ^  ►:  II  Ell /I!  All  =  (see  (B.l));  because  of  these  inequalities  and 

the  facts  that  -  Q  Q“)bll  ^  f-  II  (I  -  P^)bll/llAll  and  II  (Q  +  F)^  bll  ^ 

V  H 


I  (A  +  E)  bll/llAll,  the  second  inequality  in  (B.5)  follows  from  the  first. 
We  may  now  prove  the  main  result  of  this  appendix: 


Theorem  B.  3:  If  <  1 ,  then 


II  (A  +  E)  b  -  A  b  I 


lA  b  I 


-U  -  \  A 


/II  P,  E  II  +  kII  E  I 
A 


II  All  11  bll 
A 


(B.7) 


11  +  SI  [4  S^  +  +  2  6^  6  ) 

+  (266j^  +  - - - - - ^)ll  (A  +  E) 


(1  - 


[(1  +  Bj^)IIe“  bll  + 

%,])  . 


Proof:  From  (4.8)  of  [Ste69]  we  have 


{B.8) 


(A  +  E)  b  -  A  b  =  (A  +  P  E)  (P^  (P^  ^  -  P^)b  +  P  E  A  b]  . 

A  A  A+E  A  A 


By  (6.5)  of  [Ste69),  II  (A  +  P^  E)  II  ^llA  11/(1  -  Sj^)-  This  combines  with  (B.8), 


Lemma  B.2,  and  the  facts  that  k  =  II A  II  II A  II  ,  p^  =  a  A  ,  and  II  A'  b  II  ^  II  P^  bll /II  All  to 


give  (B.7). 
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nonlinear  function  of  x.  We  study  this  structure  and  how  to  use  it.  This  provides 
an  opportunity  to  discuss  some  ideas  applicable  to  general  unconstrained  optimization 
and,  in  particular,  to  point  out  the  advantages  of  a  roodel/trust-region  approach.  For 
nonlinear  r  (x) ,  we  recommend  generalizations  of  some  techniques  that  have  proven 
worthwhile  on  nonlinear  least-squares  problems  in  which  the  optimal  residual  vector 
r{x*)  may  be  either  large  or  small.  Appendices  consider  the  numerical  linear 
algebra  involved  in  computing  a  trial  change  to  the  parameter  vector  x  and  give  a 
new  perturbation  theorem  on  linear  least  squares  that  is  useful  in  analyzing  the 
numerical  behavior  of  the  procedure  recommended  for  computing  this  trial  change. 


